Beta Regression

Team Bee (Beta Regressionists) (Advisor: Dr. Seals)

Anaite Montes Bu, Travis Keep

Introduction

  • Regression analysis is a statistical tool used to explore relationships between variables.

  • Beta Regression: When the dependent variable is a ratio or percentage, constrained between 0 and 1.

Why not Linear Regression?

  • It can predict values outside the range of 0 to 1.
  • It assumes constant variance, which is not typical for bounded data.

Key Assumptions

  • Beta distribution: Assumes the outcome follows a beta distribution, which is flexible for variables limited to (0, 1).

  • Precision Parameter (\phi): Allows control over the variance of the outcome, enabling flexibility for data with differing levels of dispersion.

Beta distribution

The PDF of random variable with a beta distribution is as follows.

f(y) = \begin{cases} \frac{y^{\alpha-1}(1-y)^{\beta-1}}{B(\alpha,\beta)}, & 0 \le y \le 1 \\ 0, & \text{elsewhere} \end{cases} Where B(\alpha,\beta) = \int_0^1 y^{\alpha-1}(1-y)^{\beta-1} \ dy = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha+\beta)}.

\alpha and \beta are the shape variables where \alpha > 0 \quad \beta > 0. [1]

Beta Distribution Mean and Variance

\begin{align} E[Y] &= \mu = \frac{\alpha}{\alpha+\beta} \\ V[Y] &= \sigma^2 = \frac{\alpha\beta}{(\alpha+\beta)^2(\alpha+\beta+1)} \end{align} [1]

Introduction of \mu and \phi

For beta regression, it is useful to introduce the following

\mu = \frac{\alpha}{\alpha+\beta} \\ \phi = \alpha + \beta \mu is the mean of the beta regression while the higher the \phi the less the variance or the less spread out the PDF function is. [2]

Revised Beta Distribution

f(y; \mu, \phi) = \frac{\Gamma(\phi)}{\Gamma(\mu \phi) \Gamma((1 - \mu)\phi)} y^{\mu\phi - 1}(1 - y)^{(1 - \mu)\phi - 1},

\quad 0 < y < 1 Where:

  • (μ) is the mean, (ϕ) is the precision (inverse of the variance), (Γ) is the gamma function.

Beta Distribution Variance

\text{Var}(Y) = \frac{\mu(1 - \mu)}{1 + \phi}

  • When \mu is near the extremes, 0 or 1, variance drops. [3]

  • Higher values of ϕ correspond to lower variance, indicating that observations are more precise around the mean in beta regression.

Extended Beta Regression

Bias Correction/Reduction - Type of Estimator:

  • ML (Maximum Likelihood): Standard method, useful but may yield biased estimates in certain conditions.[4]

  • BC (Bias-Corrected): Adjusts estimates to correct for bias, providing more reliable parameter values.

  • BR (Bias-Reduced): Shrinks estimates towards a central value, which can improve predictive performance.

Extended Beta Regression

Beta Regression Trees

  • This extension uses recursive partitioning to model data that might exhibit subgroup-specific relationships.

  • It builds decision trees by splitting data into different subgroups based on the instability of model parameters across partitioning variables.

Methods - Suicide Rates Dataset

Model Approach:

  • Beta Regression was used to model suicide rates as a function of socio-economic factors, appropriate for data bounded between 0 and 1.

  • Dataset: Suicide Rates Overview 1985 to 2016, with variables like HDI, GDP per capita, sex, age group, and generation.

  • Cleaned data by removing outliers using Cook’s distance and leverage analysis.

  • Managed missing values and calculated descriptive statistics.

Methods Cont’d

Model Details:

  • Incorporated interaction terms and adjusted precision (phi) to account for variance differences across groups.

  • Used beta regression trees to capture nonlinear relationships.

Evaluation:

  • Model performance assessed via pseudo R-squared.

  • Software: R for data management and analysis.

Methods - Reading Skills Dataset

Model Rationale:

  • ReadingSkills dataset (N=44): Examines reading scores (0.0–1.0) for 44 children, including 19 with dyslexia and 25 without.

  • Beta regression models the response variable within the (0, 1) range, which suits the bounded reading scores better than normal regression.

Methods Cont’d

Data transformation:

  • The response variable is scaled to (0, 1) and transformed using the logit function. The precision parameter (ϕ) is log-transformed and may vary by predictors like IQ and dyslexia status.

Analysis - Suicide Rates Dataset

Extended Beta Regression

Beta Regression Base Model:

betareg(
  formula = suicide_rate ~ HDI_year + GDP_capita + sex + age + generation,
  data = suicide_dataset,
)

Bias Corrected (BC) Model:

betareg(
  formula = suicide_rate ~ HDI_year + GDP_capita + sex + age + generation | HDI_year + GDP_capita,
  data = suicide_dataset,
  type = "BC",
)

Analysis Cont’d

Extended Beta Regression

Beta Regression Trees:

The beta regression tree shows HDI_year as a key predictor, with specific thresholds creating groupings where higher HDI_year values link to better outcomes and smaller nodes show more variability.

Analysis Cont’d - Beta Regression Tree

Figure 1. Beta Regression Tree for Predicting Suicide Rate Based on HDI

Analysis Cont’d

Model Diagnostics

The package betareg allows users to perform both fixed and variable dispersion beta regression [5].

Analysis Cont’d

Model Diagnostics

The cleaned model shows improved fit, with more random residuals, fewer outliers, and reduced data point influence.

Analysis Cont’d

Analysis - Reading Skills Dataset

Regressors

  • IQ (Z-score)
    • Min -1.745
    • Median -0.122
    • Max 1.856
  • Dyslexia
    • Yes
    • No

Analysis Cont’d

Dataset Tweaking

  • Dyslexia
    • No -> 0.0
    • Yes -> 1.0
  • Reading Score
    • 1.0 -> 0.99

Remember dependent variable is in open interval (0, 1)

Analysis Cont’d

Beta Regression Fitting: Bias Corrected (BC)

betareg(
  formula = accuracy ~ dcode * iq | dcode + iq,
  data = ReadingSkillsModel,
  type = "BC",
)

General Linear Regression

glm(
  formula = accuracy ~ dcode * iq,
  family = gaussian(link = "logit"), 
  data = ReadingSkillsModel,
)

logit maps (0, 1) to \mathbb{R}

Analysis Cont’d

Results for Normal Children

Analysis Cont’d

Results for Dyslexic Children

Results - Suicide Dataset

Table 2. Impact of Socioeconomic Factors on Suicide Rates: Base and Bias-Corrected Models (Part 1)
Beta Regression Base Model Bias Correction Beta Regression
Variable Beta (SE) p-value Beta (SE) p-value
Intercept -5.75 (0.121) < 2e-16 -5.49 (1.53) < 2e-16
HDI_year 3.60 (0.160) < 2e-16 3.31 (2.03) < 2e-16
GDP_capita -6.4e-06 (6.44e-07) < 2e-16 -8.7e-06 (7.3e-07) < 2e-16
Sex (Male) 0.81 (0.019) < 2e-16 0.82 (0.018) < 2e-16
Age 25-34 years 0.084 (0.033) 0.0099 0.090 (0.032) 0.0053
Age 35-54 years 0.070 (0.040) 0.080 0.086 (0.040) 0.030
Age 5-14 years -0.96 (0.046) < 2e-16 -0.96 (0.046) < 2e-16
Age 55-74 years -0.16 (0.053) 0.0030 -0.13 (0.053) 0.011
Age 75+ years -0.21 (0.060) 0.00047 -0.18 (0.060) 0.0024
G.I. Generation 0.51 (0.048) < 2e-16 0.49 (0.048) < 2e-16
Generation X -0.22 (0.037) 4.66E-09 -0.21 (0.037) 9.60E-09
Generation Z -0.49 (0.068) 3.03E-13 -0.56 (0.067) < 2e-16
Generation Millennials -0.42 (0.046) < 2e-16 -0.43 (0.046) < 2e-16

Results Cont’d

G.I. Generation 0.51 (0.048) < 2e-16 0.49 (0.048) < 2e-16
Generation X -0.22 (0.037) 4.66E-09 -0.21 (0.037) 9.60E-09
Generation Z -0.49 (0.068) 3.03E-13 -0.56 (0.067) < 2e-16
Generation Millennials -0.42 (0.046) < 2e-16 -0.43 (0.046) < 2e-16
Generation Silent 0.098 (0.037) 0.0086 0.091 (0.037) 0.013
Precision Model
HDI_year 0.64 (0.289) 0.026
GDP_capita 8.31e-06 (1.15e-06) 4.75E-13
Model Fit
Pseudo R-squared 0.4623 0.4625

Results Cont’d

  • Lower HDI values (≤ 0.661) associated with higher variability in suicide rates, indicating socio-economic instability.

  • Higher HDI values (> 0.759) show lower, more stable suicide rates, suggesting better socio-economic conditions reduce suicide risk.

  • Initially, higher HDI leads to lower suicide rates, but at higher HDI levels, suicide rates slightly increase, highlighting complex influences beyond economic factors.

Results Cont’d

Model Performance

  • Base model: HDI, GDP, sex, age, and generation are significant predictors of suicide rates.

  • Extended model (with varying precision): Slightly improved fit, capturing more complexity in socio-economic factors influencing suicide rates.

Results - Reading Skills Dataset

Table 2

Table 2: Association of Reading Skills Score with IQ and presence of Dyslexia
Variable Beta Regression General Linear Regression
β SE p β SE p
Dyslexia -1.446 0.2954 9.767e-07 -1.598 0.2448 8.565e-08
IQ (Z-score) 1.049 0.2718 0.0001132 0.4851 0.2916 0.104
Dyslexia:iq -1.144 0.2768 3.593e-05 -0.5463 0.3145 0.09001


Results Cont’d

Dyslexia’s effect on scores

A child’s odds of answering a reading skills question correctly decreases by a factor of e^{1.446} if they are dyslexic assuming normal IQ.

Results Cont’d

IQ’s effect on scores

  • If a normal child’s IQ increases by 1 standard deviation, their odds of answering a reading skills question correctly increases by a factor of e^{1.049}

  • If a dyslexic child’s IQ increases by 1 standard deviation, their odds of answering a reading skills question correctly decreases by a factor of e^{0.095}

-0.095 = 1.049 - 1.144

Conclusion

  • Effective for proportion data, Ideal for modeling data bounded in the (0, 1) range.
  • Models both mean and precision, managing boundary cases and latent heterogeneity.
  • Bias correction and beta regression trees expand its capabilities.
  • The betareg package in R offers a powerful, flexible framework for analysts.

References

[1]
D. D. Wackerly, Mathematical statistics with applications, 6th ed. Duxbury Press, 2002.
[2]
S. Ferrari and F. Cribari-Neto, “Beta regression for modelling rates and proportions,” J. Appl. Stat., vol. 31, no. 7, pp. 799–815, Aug. 2004, doi: 10.1080/0266476042000214501.
[3]
S. Ferrari and F. Cribari-Neto, “Beta regression for modelling rates and proportions,” J. Appl. Stat., vol. 31, no. 7, pp. 799–815, Aug. 2004.
[4]
B. Grün, I. Kosmidis, and A. Zeileis, “Extended beta regression inR: Shaken, stirred, mixed, and partitioned,” J. Stat. Softw., vol. 48, no. 11, 2012.
[5]
A. Zeileis, F. Cribari-Neto, B. Grün, and I. Kosmidis, “Betareg: Beta regression.” The R Foundation, Apr. 2004.